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Abstract. Recently popularized randomized methods for principal component analysis (PCA) 
efficiently and reliably produce nearly optimal accuracy — even on parallel processors — unlike the 
classical (deterministic) alternatives. We adapt one of these randomized methods for use with data 
sets that are too large to be stored in random-access memory (RAM). (The traditional terminology 
is that our procedure works efficiently out-of-core.) We illustrate the performance of the algorithm 
£N| ■ via several numerical examples. For example, we report on the PCA of a data set stored on disk 

that is so large that less than a hundredth of it can fit in our computer's RAM. 
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1. Introduction. Principal component analysis (PCA) is among the most pop- 
ular tools in machine learning, statistics, and data analysis more generally. PCA is 
the basis of many techniques in data mining and information retrieval, including the 
latent semantic analysis of large databases of text and HTML documents described 
in pp. In this paper, we compute PCAs of very large data sets via a randomized 
version of the block Lanczos method, summarized in Section [3] below. The proofs 
in [5] and [T3] show that this method requires only a couple of iterations to produce 
nearly optimal accuracy, with overwhelmingly high probability (the probability is in- 
dependent of the data being analyzed, and is typically 1 — 10~ 15 or greater). The 
randomized algorithm has many advantages, as shown in [5] and |13) ; the present 
article adapts the algorithm for use with data sets that are too large to be stored in 

I/"") ' the random-access memory (RAM) of a typical computer system. 

r^| . Computing a PCA of a data set amounts to constructing a singular value de- 

composition (SVD) that accurately approximates the matrix A containing the data 
being analyzed (possibly after suitably "normalizing" A, say by subtracting from 
each column its mean). That is, if A is m x n, then we must find a positive integer 
k < min(m, n) and construct matrices U, S, and V such that 
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A^UZV T 1 (1.1) 

with U being an m x k matrix whose columns are orthonormal, V being an n x k 
matrix whose columns are orthonormal, and S being a diagonal k x k matrix whose 
entries are all nonnegative. The algorithm summarized in Section [3] below is most 
efficient when k is substantially less than min(?7i, n); in typical real- world applications, 
k <C min(m, n). Most often, the relevant measure of the quality of the approximation 
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in (II. ip is the spectral norm of the discrepancy A — U £ V T ; see, for example, Section[3] 
below. The present article focuses on the spectral norm, though our methods produce 
similar accuracy in the Frobenius/Hilbert-Schmidt norm (see, for example, [5]). 

The procedure of the present article works to minimize the total number of times 
that the algorithm has to access each entry of the matrix A being approximated. A 
related strategy is to minimize the total number of disk seeks and to maximize the 
dimensions of the approximation that can be constructed with a given amount of 
RAM; the algorithm in [5] takes this latter approach. 

In the present paper, the entries of all matrices are real valued; our techniques 
extend trivially to matrices whose entries are complex valued. The remainder of 
the article has the following structure: Section [5] explains the motivation behind the 
algorithm. Section [3] outlines the algorithm. Section @] details the implementation for 
very large matrices. Section [5] quantifies the main factors influencing the running- 
time of the algorithm. Section [B] illustrates the performance of the algorithm via 
several numerical examples. Section [7] applies the algorithm to a data set of interest 
in biochemical imaging. Section |8] draws some conclusions and proposes directions for 
further research. 

2. Informal description of the algorithm. In this section, we provide a brief, 
heuristic description. Section[3]below provides more details on the algorithm described 
intuitively in the present section. 

Suppose that k, m, and n are positive integers with k < m and k < n, and A is 
a real m x n matrix. We will construct an approximation to A such that 

\\A-UXV T \\ 2 ^a k+u (2.1) 

where U is a real m x k matrix whose columns are orthonormal, V is a real n x k 
matrix whose columns are orthonormal, £ is a diagonal real k x k matrix whose entries 
are all nonnegative, || A — U £ V T ||2 is the spectral (Z 2 -operator) norm of A — U £ V T , 
and <7fc + i is the (k+ l)st greatest singular value of A. To do so, we select nonnegative 
integers i and I such that I > k and (i + 2)k < n (for most applications, I = k + 2 and 
i < 2 is sufficient; ||^4 — [/£ U T ||2 will decrease as i and I increase), and then identify 
an orthonormal basis for "most" of the range of A via the following two steps: 

1. Using a random number generator, form a real n x I matrix G whose entries 
are independent and identically distributed Gaussian random variables of zero 
mean and unit variance, and compute the m X ((? + 1)1) matrix 

H= ( AG \ AA T AG \ ... \ (AA T ) i - 1 AG \ (AA T ) l AG ). (2.2) 

2. Using a pivoted Q-R-decomposition, form a real m x ((i + 1)1) matrix Q whose 
columns are orthonormal, such that there exists a real ((i + 1)1) x ((i + 1)1) 
matrix R for which 

H = QR. (2.3) 

(See, for example, Chapter 5 in [4] for details concerning the construction of 

such a matrix Q.) 
Intuitively, the columns of Q in (|2.3[) constitute an orthonormal basis for most of 
the range of A. Moreover, the somewhat simplified algorithm with i — is sufficient 
except when the singular values of A decay slowly; see, for example, [5]. 

Notice that Q may have many fewer columns than A, that is, k may be sub- 
stantially less than n (this is the case for most applications of principal component 
analysis). This is the key to the efficiency of the algorithm. 
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Having identified a good approximation to the range of A, we perform some simple 
linear algebraic manipulations in order to obtain a good approximation to A, via the 
following four steps: 

3. Compute the n X ((j 1 1)Z) product matrix 

T = A T Q. (2.4) 

4. Form an SVD of T, 

T = V±W T , (2.5) 

where V is a real n x ((i + 1)1) matrix whose columns are orthonormal, W 
is a real ((i + 1)0 X ((i + 1)/) matrix whose columns are orthonormal, and £ 
is a real diagonal ((i + 1)1) x ((i + 1)Z) matrix such that Ei i > £2,2 > • • • > 
s ( 4 +i)i-i,(i+i)i-i > ^(i+i)z,(i+i)z > 0. (See, for example, Chapter 8 in @] for 
details concerning the construction of such an SVD.) 

5. Compute the m x ((« + 1)Z) product matrix 

£> = QVF. (2.6) 

6. Retrieve the leftmost m x k block U of t/, the leftmost n x k block V of V", 
and the leftmost uppermost k x k block S of S. 

The matrices U, S, and V obtained via Steps 1-6 above satisfy (|2.1|l ; in fact, they 
satisfy the more detailed bound (|3.1|l described below. 



3. Summary of the algorithm. In this section, we will construct a low-rank 
(say, rank k) approximation U £ V T to any given real matrix A, such that 

||i4 - C/S V T || 2 < < s l{Ckn) 1 /( 2i + l ) + min(l, C/n) a k+1 (3.1) 

with high probability (independent of A) , where m and n are the dimensions of the 
given m x n matrix A, U is a real mx k matrix whose columns are orthonormal, V is 
a real n x k matrix whose columns are orthonormal, S is a real diagonal k x k matrix 
whose entries are all nonnegative, <Jk+i is the (k + l)st greatest singular value of A, 
and C is a constant determining the probability of failure (the probability of failure is 
small when C = 10, negligible when C = 100). In Q3.1[l . i is any nonnegative integer 
such that (i + 2)k < n (for most applications, i = 1 or i = 2 is sufficient; the algorithm 
becomes less efficient as i increases), and \\A— lfEV T \\2 is the spectral (Z 2 -operator) 
norm of A - U S V T , that is, 

IA-UWU- — IM-t[7 T >'lh , ( 3 . 2) 

xe»«:||x|| 2 ^0 ||x|| 2 



Nils 



\ 



X>,) 2 - (3-3) 



To simplify the presentation, we will assume that n < m (if n > m, then the user 
can apply the algorithm to A T ). In this section, we summarize the algorithm; see [5] 
and [13] for an in-depth discussion, including proofs of more detailed variants of (|3.ip . 
The minimal value of the spectral norm ||A — B\\2, minimized over all rank-fc 
matrices B, is a k +i (see, for example, Theorem 2.5.3 in 0]). Hence, (|3.1|) guarantees 
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that the algorithm summarized below produces approximations of nearly optimal 
accuracy. 

To construct a rank-/c approximation to A, we could apply A to about k random 
vectors, in order to identify the part of its range corresponding to the larger singu- 
lar values. To help suppress the smaller singular values, we apply A (A A) 1 , too. 
Once we have identified "most" of the range of A, we perform some linear-algebraic 
manipulations in order to recover an approximation satisfying (|3.1|) . 

A numerically stable realization of the scheme outlined in the preceding paragraph 
is the following. We choose an integer I > k such that (i + 1)1 < n — k (it is generally 
sufficient to choose I — k + 2; increasing I can improve the accuracy marginally, but 
increases computational costs), and make the following six steps: 

1. Using a random number generator, form a real n x I matrix G whose entries 
are independent and identically distributed Gaussian random variables of 
zero mean and unit variance, and compute the m X I matrices H^°\ H 1 - 1 ' , 
. . . , H^ 1 ', iifW defined via the formulae 

HW=AG, (3.4) 

#« = A(A T H^), (3.5) 

H^ = A(A T H^), (3.6) 



H® =A(A T H ( - i - 1 '>). (3.7) 

Form the mx((i + l)l) matrix 

H = ( F(°) | #W | ... | H^-V | ffW ) . (3.8) 

2. Using a pivoted (5_R-decomposition, form a real m x ((i + l)l) matrix Q whose 
columns are orthonormal, such that there exists a real ((i + 1)1) x ((i + 1)1) 
matrix R for which 

H = QR. (3.9) 

(See, for example, Chapter 5 in [1] for details concerning the construction of 
such a matrix Q.) 

3. Compute the n X ((i + 1)1) product matrix 

T = A T Q. (3.10) 

4. Form an SVD of T, 

T = VtW T , (3.11) 

where V is a real n X ((i + 1)1) matrix whose columns are orthonormal, W 
is a real ((i + 1)0 X ((i + 1)1) matrix whose columns are orthonormal, and E 
is a real diagonal ((i + 1)1) x ((i + 1)1) matrix such that Si i > £2,2 ^ " ' ^ 
E(i+i)l_i,(i+i)j_i > £(i+i);,(i+i); > 0. (See, for example, Chapter 8 in @] for 
details concerning the construction of such an SVD.) 
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5. Compute the m x ((i + 1)1) product matrix 

U = QW. (3.12) 

6. Retrieve the leftmost m x k block U of U, the leftmost n x k block V of V, 
and the leftmost uppermost k x k block £ of £. The product (7Sy T then 
approximates A as in (|3.1[) (we omit the proof; see [5] for proofs of similar, 
more general bounds). 

Remark 3.1. In the present paper, we assume that the user specifies the rank k 
of the approximation U T, V being constructed. See [5] for techniques for determining 
the rank k adaptively, such that the accuracy ||^4— U £ V T H2 satisfying (|3.1j) also meets 
a user-specified threshold. 

Remark 3.2. Variants of the fast Fourier transform (FFT) permit additional 
accelerations; see [5], [7J, and [14]. However, these accelerations have negligible ef- 
fect on the algorithm running out-of-core. For out-of-core computations, the simpler 
techniques of the present paper are preferable. 

Remark 3.3. The algorithm described in the present section can underflow or 
overflow when the range of the floating-point exponent is inadequate for representing 
simultaneously both the spectral norm \\A\\ 2 and its (2i + l)st power (H^l^) 21-1 " 1 . A 
convenient alternative is the algorithm described in [8]; another solution is to process 
A/||A|| 2 rather than A. 

4. Out-of-core computations. With suitably large matrices, some steps in 
Section [3] above require either storage on disk, or on-the-fly computations obviating 
the need for storing all the entries of the mxn matrix A being approximated. Conve- 
niently, Steps 2, 4, 5, and 6 involve only matrices having 0((i + l) I (m+n)) entries; we 
perform these steps using only storage in random-access memory (RAM). However, 
Steps 1 and 3 involve A, which has mn entries; we perform Steps 1 and 3 differently 
depending on how A is provided, as detailed below in Subsections 14.11 and 14.21 

4.1. Computations with on-the-fly evaluation of matrix entries. If A 

does not fit in memory, but we have access to a computational routine that can 
evaluate each entry (or row or column) of A individually, then obviously we can 
perform Steps 1 and 3 using only storage in RAM. Every time we evaluate an entry 
(or row or column) of A in order to compute part of a matrix product involving A or 
A T , we immediately perform all computations associated with this particular entry 
(or row or column) that contribute to the matrix product. 

4.2. Computations with storage on disk. If A does not fit in memory, but is 
provided as a file on disk, then Steps 1 and 3 require access to the disk. We assume for 
definiteness that A is provided in row- major format on disk (if A is provided in column- 
major format, then we apply the algorithm to A T instead). To construct the matrix 
product in (|3.4[) , we retrieve as many rows of A from disk as will fit in memory, form 
their inner products with the appropriate columns of G, store the results in H^ ' , and 
then repeat with the remaining rows of A. To construct the matrix product in (|3.10p . 
we initialize all entries of T to zeros, retrieve as many rows of A from disk as will fit in 
memory, add to T the transposes of these rows, weighted by the appropriate entries of 
Q, and then repeat with the remaining rows of A. We construct the matrix product 
in (|3.5|) similarly, forming F — A T H^ first, and H^> — AF second. Constructing 
the matrix products in (I3.6I) - (|3.7I) is analogous. 
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5. Computational costs. In this section, we tabulate the computational costs 
of the algorithm described in Section^ for the particular out-of-core implementations 
described in Subsections 14.11 and 14.21 We will be using the notation from Section 
including the integers i, k, I, m, and n, and the m x n matrix A. 

Remark 5.1. For most applications, i < 2 suffices. In contrast, the classi- 
cal Lanczos algorithm generally requires many iterations in order to yield adequate 
accuracy, making the computational costs of the classical algorithm prohibitive for 
out-of-core (or parallel) computations (see, for example, Chapter 9 in [3]). 

5.1. Costs with on-the-fly evaluation of matrix entries. We denote by 
Ca the number of floating-point operations (flops) required to evaluate all nonzero 
entries in A. We denote by Na the number of nonzero entries in A. With on-thc-fly 
evaluation of the entries of A, the six steps of the algorithm described in Section [3] 
have the following costs: 

1. Forming H^ in ()3.4|) costs Ca + 0(1 Na) flops. Forming any of the matrix 
products in (|3~5l- (pT7|) costs 2C A + 0(lN A ) flops. Forming H in (|3T5|) costs 
0(ilm) flops. All together, Step 1 costs (2i + 1) C A + 0(il(m + N A )) flops. 

2. Forming Q in (J3.9I) costs 0(i 2 l 2 m) flops. 

3. Forming T in (fXTOj) costs C A + 0(il N A ) flops. 

4. Forming the SVD of T in (f3TT|) costs 0(i 2 l 2 n) flops. 

5. Forming U in (|3.12j) costs 0{i 2 l 2 m) flops. 

6. Forming U, S, and V in Step 6 costs 0(k(m + n)) flops. 

Summing up the costs for the six steps above, and using the fact that k < I < 
n < m, we see that the full algorithm requires 

Qm-thc-fly = 2(* + 1) C A + 0{il N A + i 2 l 2 m) (5.1) 

flops, where Ca is the number of flops required to evaluate all nonzero entries in A, 
and Na is the number of nonzero entries in A. In practice, we choose I fa k (usually 
a good choice is I = k + 2). 

5.2. Costs with storage on disk. We denote by j the number of floating-point 
words of random-access memory (RAM) available to the algorithm. With A stored 
on disk, the six steps of the algorithm described in Section [3] have the following costs 
(assuming for convenience that j > 2 (i + 1)1 (m + n)): 

1. Forming _ff' ' in (|3.4[) requires at most 0(lmn) floating-point operations 
(flops), (D(mn/j) disk accesses/seeks, and a total data transfer of 0(mn) 
floating-point words. Forming any of the matrix products in (|3.5p ~ (l3.7p also 
requires 0(lmn) flops, 0(mn/j) disk accesses/seeks, and a total data transfer 
of (D(mn) floating-point words. Forming H in (|3.8[) costs 0(ilm) flops. All 
together, Step 1 requires O(ilmn) flops, 0(imn/j) disk accesses/seeks, and 
a total data transfer of 0(imn) floating-point words. 

2. Forming Q in (|3"l?f costs 0(i 2 l 2 m) flops. 

3. Forming T in (|3.10[) requires 0(ilmn) floating-point operations, 0(mn/j) 
disk accesses/seeks, and a total data transfer of 0(mn) floating-point words. 

4. Forming the SVD of T in (|3~TT|) costs 0(i 2 l 2 n) flops. 

5. Forming U in (I3TT2"]) costs 0(i 2 l 2 m) flops. 

6. Forming U, S, and V in Step 6 costs 0(k(m + n)) flops. 

In practice, we choose / ~ k (usually a good choice is I = k + 2). Summing up 
the costs for the six steps above, and using the fact that k < I < n < m, we see that 
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the full algorithm requires 



C flops = 0{ilmn + i 2 l 2 m) (5.2) 



flops, 



Cacccsscs = 0(imn/j) (5.3) 

disk accesses/seeks (where j is the number of floating-point words of RAM available 
to the algorithm), and a total data transfer of 

Cwords = 0(imn) (5.4) 

floating-point words (more specifically, C wor ds ~ 2(i + \)mn). 

6. Numerical examples. In this section, we describe the results of several 
numerical tests of the algorithm of the present paper. 

We set I — fc + 2 for all examples, setting i — 3 for the first two examples, and i = 1 
for the last two, where i, k, and I are the parameters from Section [3] above. We ran 
all examples on a laptop with 1.5 GB of random- access memory (RAM), connected 
to an external hard drive via USB 2.0. The processor was a single-core 32-bit 2-GHz 
Intel Pentium M, with 2 MB of L2 cache. We ran all examples in Matlab 7.4.0, stor- 
ing floating-point numbers in RAM using IEEE standard double-precision variables 
(requiring 8 bytes per real number), and on disk using IEEE standard single-precision 
variables (requiring 4 bytes per real number). 

All our numerical experiments indicate that the quality and distribution of the 
pseudorandom numbers have little effect on the accuracy of the algorithm of the 
present paper. We used Matlab's built-in pseudorandom number generator for all 
results reported below. 

6.1. Synthetic data. In this subsection, we illustrate the performance of the 
algorithm with the principal component analysis of three examples, including a com- 
putational simulation. 

For the first example, we apply the algorithm to the m x n matrix 

A = ESF, (6.1) 

where E and F are m x m and n x n unitary discrete cosine transforms of the second 
type (DCT-II), and S is an m x n matrix whose entries are zero off the main diagonal, 
with 

f 10-4(J-i)/i9, j = 1,2,..., 19, or 20 

tij ' j ~ { 10- 4 /(j - 20) 1 / 10 , j = 21, 22, . . . ,n - 1, or n. l ° ' 

Clearly, S% t i, ^2,2, • ■ • , S n -i. n -i, S n ,n are the singular values of A. 

For the second example, we apply the algorithm to the m x n matrix 

A = ESF, (6.3) 

where E and F are mxm and n x n unitary discrete cosine transforms of the second 
type (DCT-II), and S is an m x n matrix whose entries are zero off the main diagonal, 
with 



(6.4) 
1, or n. 





' 1.00, 


.7 = 1,2, or 3 




0.67, 


j = 4, 5, or 6 


Sj,j — < 


0.34, 


3 = 7,8, or 9 




0.01, 


j = 10,11, or 12 




0.01- "^ 

^ n— 


13) J = 13, 14,... ,71 
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Table 1a 
On-disk storage of the first example. 

in n k t gcn tpcA £0 £ 



2E5 


2E5 


16 


2.7E4 


6.6E4 


4.3E-4 


4.3E-4 


2E5 


2E5 


20 


2. TEA 


6.6E4 


1.0E-4 


1.0E-4 


2E5 


2E5 


21 


2.7E4 


6.9E4 


1.0E-4 


1.0E-4 



Table 1b 
On-the-fly generation of the first example. 

m n k tpcA £o £ 

2E5 2E5 16 7.7E1 4.3E-4 4.3E-4 

2E5 2E5 20 1.0E2 1.0E-4 1.0E-4 

2E5 2E5 24 1.3E2 1.0E-4 1.0E-4 



Clearly, Si t i, 52,2, • ■ • , Sn-i.n-i, S n . n are the singular values of A. 

Table la summarizes results of applying the algorithm to the first example, storing 
on disk the matrix being approximated. Table lb summarizes results of applying the 
algorithm to the first example, generating on-the-fly the columns of the matrix being 
approximated. 

Table 2a summarizes results of applying the algorithm to the second example, 
storing on disk the matrix being approximated. Table 2b summarizes results of ap- 
plying the algorithm to the second example, generating on-the-fly the columns of the 
matrix being approximated. 

The following list describes the headings of the tables: 

• m is the number of rows in the matrix A being approximated. 

• n is the number of columns in the matrix A being approximated. 

• k is the parameter from Section above; fc is the rank of the approximation 
being constructed. 

• £ gen is the time in seconds required to generate and store on disk the matrix 
A being approximated. 

• ipcA is the time in seconds required to compute the rank-fc approximation 
(the PCA) provided by the algorithm of the present paper. 

• £0 is the spectral norm of the difference between the matrix A being approx- 
imated and its best rank-fc approximation. 

• e is an estimate of the spectral norm of the difference between the matrix A 
being approximated and the rank-fc approximation produced by the algorithm 
of the present paper. The estimate e of the error is accurate to within a 
factor of two with extraordinarily high probability; the expected accuracy of 
the estimate e of the error is about 10%, relative to the best possible error e 
(see [5] ) . The appendix below details the construction of the estimate e of the 
spectral norm of D = A— UY.V T , where A is the matrix being approximated, 
and WSV T is the rank-fc approximation produced by the algorithm of the 
present paper. 

For the third example, we apply the algorithm with k — 3 to an m x 1000 
matrix whose rows are independent and identically distributed (i.i.d.) realizations of 
the random vector 

awi + (3iV2+jw 3 + S, (6.5) 

where Wi, u>2, and W3 are orthonormal 1 x 1000 vectors, 5 is a 1 x 1000 vector whose 
entries are i.i.d. Gaussian random variables of mean zero and standard deviation 
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Table 2a 
On-disk storage of the second example. 

m n k t gon ipcA £0 £ 



2E5 2E5 


12 


2.7E4 6.3E4 


1.0E-2 1.0E-2 


2E5 2E4 


12 


1.9E3 6.1E3 


1.0E-2 1.0E-2 


5E5 8E4 


12 


2.2E4 6.5E4 
Table 2b 


1.0E-2 1.0E-2 


On-the-fly generation of the second example. 


m 


n 


k tpcA 


£0 £ 


2E5 


2E5 


12 5.5E1 1.0E-2 1.0E-2 


2E5 


2E4 


12 2.7E1 1.0E-2 1.0E-2 


5E5 


8E4 


12 7.9E1 1.0E-2 1.0E-2 



0.1, and (a, /3, 7) is drawn at random from inside an ellipsoid with axes of lengths 
a = 1.5, 6=1, and c = 0.5, specifically, 

a — ar (cos ip) sin#, (6-6) 

fi = br (sin ip) sin8, (6.7) 

7 = cr cos 9, (6.8) 

with r drawn uniformly at random from [0, 1], tp drawn uniformly at random from 
[0, 2tt], and 9 drawn uniformly at random from [0,tt]. We obtained w\, 102, and W3 by 
applying the Gram-Schmidt process to three vectors whose entries were i.i.d. centered 
Gaussian random variables; W\, W2, and W3 are exactly the same in every row, whereas 
the realizations of a, /3, 7, and S in the various rows are independent. We generated all 
the random numbers on-the-fly using a high-quality pseudorandom number generator; 
whenever we had to regenerate exactly the same matrix (as the algorithm requires 
with i > 0), we restarted the pseudorandom number generator with the original seed. 
Figure la plots the inner product (i.e., correlation) of w\ in (|6.5|) and the (nor- 
malized) right singular vector associated with the greatest singular value produced 
by the algorithm of the present article. Figure la also plots the inner product of W2 
in (|6.5[) and the (normalized) right singular vector associated with the second greatest 
singular value, as well as the inner product of W3 and the (normalized) right singular 
vector associated with the third greatest singular value. Needless to say, the inner 
products (i.e., correlations) all tend to 1, as m increases — as they should. Figure lb 
plots the time required to run the algorithm of the present paper, generating on-the-fly 
the entries of the matrix being processed. The running-time is roughly proportional 
to 771, in accordance with (15. ip . 



6.2. Measured data. In this subsection, we illustrate the performance of the 
algorithm with the principal component analysis of images of faces. 

We apply the algorithm with k — 50 to the 393,216 x 102,042 matrix whose 
columns consist of images from the FERET database of faces described in [10] and [IT] , 
with each image duplicated three times. For each duplicate, we set the values of a 
random choice of 10% of the pixels to numbers chosen uniformly at random from the 
integers 0,1, . . . , 254, 255; all pixel values are integers from 0, 1, ... , 254, 255. Before 
processing with the algorithm of the present article, we "normalized" the matrix by 
subtracting from each column its mean, then dividing the resulting column by its 
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Euclidean norm. The algorithm of the present paper required 12.3 hours to process 
all 150 GB of this data set stored on disk, using the laptop computer with 1.5 GB of 
RAM described earlier (at the beginning of Section [5]) . 

Figure 2a plots the computed singular values. Figure 2b displays the computed 
"eigenfaces" (that is, the left singular vectors) corresponding to the five greatest 
singular values. 

While this example does not directly provide a reasonable means for performing 
face recognition or any other task of image processing, it does indicate that the sheer 
brute force of linear algebra (that is, computing a low-rank approximation) can be 
used directly for processing (or preprocessing) a very large data set. When used alone, 
this kind of brute force is inadequate for face recognition and other tasks of image 
processing; most tasks of image processing can benefit from more specialized methods 
(see, for example, (S], [TU], and QT]). Nonetheless, the ability to compute principal 
component analyses of very large data sets could prove helpful, or at least convenient. 

7. An application. In this section, we apply the algorithm of the present paper 
to a data set of interest in a currently developing imaging modality known as single- 
particle cryo-electron microscopy. For an overview of the field, see [3], [12) . and their 
compilations of references. 

The data set consists of 10,000 two-dimensional images of the (three-dimensional) 
charge density map of the E. coli 50S ribosomal subunit, projected from uniformly 
random orientations, then added to white Gaussian noise whose magnitude is 32 
times larger than the original images', and finally rotated by 0, 1, 2, ... , 358, 359 
degrees. The entire data set thus consists of 3,600,000 images, each 129 pixels wide 
and 129 pixels high; the matrix being processed is 3,600,000 x 129 2 . We set i = 1, 
k = 250, and I = k + 2, where i, k, and I are the parameters from Section |3] above. 
Processing the data set required 5.5 hours on two 2.8 GHz quad-core Intel Xeon x5560 
microprocessors with 48 GB of random-access memory. 

Figure 3a displays the 250 computed singular values. Figure 3b displays the 
computed right singular vectors corresponding to the 25 greatest computed singular 
values. Figure 3c displays several noisy projections, their versions before adding 
the white Gaussian noise, and their denoised versions. Each denoised image is the 
projection of the corresponding noisy image on the computed right singular vectors 
associated with the 150 greatest computed singular values. The denoising is clearly 
satisfactory. 

8. Conclusion. The present article describes techniques for the principal com- 
ponent analysis of data sets that are too large to be stored in random-access memory 
(RAM), and illustrates the performance of the methods on data from various sources, 
including standard test sets, numerical simulations, and physical measurements. Sev- 
eral of our data sets stored on disk were so large that less than a hundredth of any 
of them could fit in our computer's RAM; nevertheless, the scheme always succeeded. 
Theorems, their rigorous proofs, and their numerical validations all demonstrate that 
the algorithm of the present paper produces nearly optimal spectral-norm accuracy. 
Moreover, similar results are available for the Frobenius/Hilbert-Schmidt norm. Fi- 
nally, the core steps of the procedures parallelize easily; with the advent of widespread 
multicore and distributed processing, exciting opportunities for further development 
and deployment abound. 

Appendix. In this appendix, we describe a method for estimating the spectral 
norm ||.D[[2 of a matrix D. This procedure is particularly useful for checking whether 
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an algorithm has produced a good approximation to a matrix (for this purpose, we 
choose D to be the difference between the matrix being approximated and its approx- 
imation). The procedure is a version of the classic power method, and so requires the 
application of D and D T to vectors, but does not use D in any other way. Though 
the method is classical, its probabilistic analysis summarized below was introduced 
fairly recently in [2J and [BJ (see also Section 3.4 of [14)). 

Suppose that m and n are positive integers, and D is a real m X n matrix. We 
define a/ 1 ), uj^ 2 \ a/ 3 ), ... to be real uxl column vectors with independent and 
identically distributed entries, each distributed as a Gaussian random variable of zero 
mean and unit variance. For any positive integers j and k, we define 



/ ll(L> T DV w^Wo 

^(°> = ,SV ii(W--Jl - I 8 - 1 ' 

which is the best estimate of the spectral norm of D produced by j steps of the power 
method, started with k independent random vectors (see, for example, [BJ). Naturally, 
when computing pj.k{D), we do not form D T D explicitly, but instead apply D and 
D T successively to vectors. 

Needless to say, pj t k{D) < \\D\\2 for any positive j and k. A somewhat involved 
analysis shows that the probability that 

P Jt k( D ) > \\D\\ 2 /2 (8.2) 

is greater than 



2n 



/,'/ 2 



The probability in (J8.3P tends to 1 very quickly as j increases. Thus, even for fairly 
small j, the estimate p^k{D) of the value of ||-D||2 is accurate to within a factor of two, 
with very high probability; we used j = 6 for all numerical examples in this paper. We 
used the procedure of this appendix to estimate the spectral norm in (|3.1|) . choosing 
D = A — WSV T , where A, U, E, and V are the matrices from (|3.ip . We set k for 
Pj : k{D) to be equal to the rank of the approximation U E V T being constructed. 
For more information, see [2], [B], or Section 3.4 of [T4"] . 
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Fig. 1a. Convergence for the third example (the computational simulation). 
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Fig. 1b. Timing for the third example (the computational simulation). 
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Fig. 2a. Singular values computed for the fourth example (the database of images). 
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Fig. 2b. Dominant singular vectors computed for the fourth example (the database of images). 
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Fig. 3a. Singular values computed for the E. coli data set. 
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Fig. 3b. Dominant singular vectors computed for the E. coli data set. 
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Fig. 3c. Noisy, clean, and denoised images for the E. coli data set. 
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